Module 2 Report

Data2902 Project 1

490283370

University of Sydney | DATA2902 | September 2019

1 Data Overview

1.1 Source

The data we will be analysing is a sample containing answers to questions asked in an online survey by DATA2X02 students. This data was collected through an online form, whose link was given to all students of DATA2X02. The survey contained many questions of various types including multiple-choice, free-form, etc. Let’s begin with reading the dataset and have a look at it using dplr library (A part of tidyverse package).

## Observations: 110
## Variables: 22
## $ Timestamp                                                                             <chr> ...
## $ Identifier                                                                            <chr> ...
## $ Gender                                                                                <chr> ...
## $ `Postcode of where you live during semester`                                          <dbl> ...
## $ `What statistics courses have you taken?`                                             <chr> ...
## $ `How many university clubs are you a member of?`                                      <dbl> ...
## $ `How long has it been since you last went to the dentist?`                            <chr> ...
## $ `On average, how many hours per week did you spend on university work last semester?` <dbl> ...
## $ `What is your favourite social media platform?`                                       <chr> ...
## $ `How many siblings do you have?`                                                      <dbl> ...
## $ `Did you have a pet growing up?`                                                      <chr> ...
## $ `Do you currently live with your parents?`                                            <chr> ...
## $ `How many hours a week do you spend exercising?`                                      <chr> ...
## $ `What is your eye colour?`                                                            <chr> ...
## $ `How many hours per week do you work in paid employment?`                             <dbl> ...
## $ `What is your favourite season of the year?`                                          <chr> ...
## $ `What is your shoe size?`                                                             <dbl> ...
## $ `How tall are you?`                                                                   <dbl> ...
## $ `How often do you floss your teeth?`                                                  <chr> ...
## $ `Do you wear glasses or contacts?`                                                    <chr> ...
## $ `What is your dominant hand?`                                                         <chr> ...
## $ `How do you like your steak cooked?`                                                  <chr> ...

1.2 Variables and Data Cleaning

As we can see, the names of columns are unnecessarily long and not R-friendly. We will use janitor library’s clean_names() and rename() functions to change column names.

We can also see that many variable types are not properly accessed by R. For example variable exercise is accessed as chr but it is actually a numeric type variable.mutate function is used to change variable types:

  • postcode to factor type
  • clubs to integer type
  • dentist to factor type
  • siblings to integer type
  • pet_growing_up to factor type
  • live_with_parents to factor type
  • exercise to numeric type
  • fav_season to factor type
  • floss_frequency to factor
  • glasses to factor type
  • handedness to factor type
  • doneness to factor type

This data needed a lot of cleaning, we will be cleaning specific variables when we analyse them. NA values are not prominent as we can see below (using vis_miss() function):

Most NULL values are present in the identifier(13.64%) and postcode(3.64%) column. Students might have left postcode variable as NULL as they might not be comfortable sharing their location in the survey.

We will handle the NULL values in respective columns when need be.

2 Data Analysis

2.1 Is this sample random?

This sample is not random as:

  • The students who are not very attentive about the UOS probably won’t know about the survey. Thus, this data doesn’t represent all the students but only those who are attentive in class.

  • Some entries are filled unrealistically making the sample less represntative of the population.

  • As Ed platform is not compulsory to be active on, some students might have missed the announcement and hence, not filled the form

2.2 Bias

  • Selection bias: The students filling the form might be only those who are regular with the unit.

  • Measurement bias:
    • Recall bias: A person might not remeber stuff like height and shoe size.
    • Sensitive questions: Students might not answer or answer dishonestly to the questions they find unfavourable,for example, they might be uncomfortable to share their postcode and the number of hours they study .
  • Sampling method: Some students might not check the announcements on Ed platform regularly making them unlikely to fill the survey.

2.3 Variables with bias

  • Study: Only the regular students are probable to fill the survey thus making the sample less representative of the actual population. Also, some students might not have answered the question honestly due to reasons like peer pressure, insecurity, etc…

  • Clubs: Some students who are highly active in many clubs might not be able to attend classes regularly.

  • Hours Employed: Some students who work many hours might not be able to maintain the regularity with the Unit affecting their ability to fill the survey on time.

  • Height and Shoe size: Student might not remember their exact height and would have filled an approximated guess.

  • Postcode: Students might feel insecure in sharing their location.

2.4 Exercise done by vegetarians and non-vegetarian

Having a look at the distribution of the doneness column and representing it clearly using kable() function:

doneness n percent valid_percent
Blue 1 0.9% 0.9%
I don’t eat red meat 7 6.4% 6.4%
Medium 20 18.2% 18.3%
Medium-rare 42 38.2% 38.5%
Medium-well done 23 20.9% 21.1%
Rare 4 3.6% 3.7%
Well done 12 10.9% 11.0%
NA 1 0.9%

We have to categorize all the options under Vegetarian or Non-Vegetarian categories respectively to analyse the data (using mutate() function).

Having a look at the exercise column using kable() function:

diet exercise
Length:109 Min. :0.000e+00
Class :character 1st Qu.:1.000e+00
Mode :character Median :3.000e+00
NA Mean :8.462e+16
NA 3rd Qu.:6.000e+00
NA Max. :9.223e+18

Maximum value seems to be quite unrealistic. So, we have to make some changes to this column to remove unrealistic outliers (for example, exercise more than 24*7 hours a week) which might affect our analysis (using mutate function).

Now, we have our cleaned data ready to be analysed. As we have to check if there is a significant difference in exercising habits of people who are vegetarian and non-vegetarian, let’s first check the normality of these two sets by making boxplots and qqplots using ggplot2 and plotly library.

The box plot for two groups shows that they are not quite symmetric which does not support the normality assumption. There is also an outlier in the non-vegetarian group which further makes the normality assumption worse.

On the right, we can see the qq-plot of the two groups. For the non-vegetarian people the extremes are pretty substantial, which is quite against the normality assumption. Vegetarian people’s qq points seem to roughly lie on the qq line but it might not be correct to assume normality because of less number of people on the Vegetarian side.

As we can see from the above qq plot and the below given density plot, we can assume that they roughly follow the same distribution. The dots in the qq-plot also seem to be in similar shape.

So, the conclusion is that we can not assume normality for this sample. Further, we can roughly assume that they come from the same distribution. Therefore, we can use wilcoxon rank-sum test as it drops the normality assumption.

Specifying the hypothesis properly:

  1. Hypothesis: \(H_{0}: \mu_{x} = \mu_{y}\) vs \(H_{1}: \mu_{x} \not= \mu_{y}\), where \(\mu_{x}\) is the mean of the vegetarian population’s (All students of the class) exercising hours and \(\mu_{y}\) is the mean of the non-vegetarian population’s exercising hours.

  2. Assumption: \(X_{i}\) and \(Y_{i}\) are independent and follow the same distribution but differ by a shift, where \(X_{i}\) are all the observations of the exercise done by vegetarian students in the sample and \(Y_{i}\) are that of non-vegetarian students, \(i \in 1,2,\dots,107\)

As all the students filled their answers independently, all observations are independent. Also, from the above discussion and the density plot, we can safely assume that they follow the same distribution possibly with a shift.

A short summary of each group’s exercising hours using summarise() function:

diet n mean sd
Non_Veg 100 4.0125 3.866192
Veg 7 5.0000 4.690416
  1. Test Statistic: \(W = R_{1} + R_{2}+\dots+R_{n_{x}}\). Under \(H_{0}\), \(W\) follows the \(WRS(n_{x}, n_{y})\) distribution, where \(n_{x} = 100\), \(n_{y}=7\) and \(R_{i}\) are the ranks of the first group(\(X_{i}\)).
## [1] 5354.5
  1. Observed Test Statistic: \(t_{0} =\) 5354.5
## 
##  Wilcoxon rank sum test
## 
## data:  exercise by diet
## W = 304.5, p-value = 0.5638
## alternative hypothesis: true location shift is not equal to 0
  1. P-value: Also, we know that if the test statistic w is less than its minimum value i.e. \(n_{x}(n_{x} + 1)/2\), then the p-value is \(2P(W\leq w)\).

w = 5354.5 and \(n_{x}(n_{x} + 1)/2\) = 5050

So,

\(2P(W\leq w) =\) 0.563797

  1. Decision: As the p-value 0.563797 is quite greater than 0.05, we don’t have enough evidence to reject \(H_{0}\). So, the data is consistent with the null hypothesis that there is not a significant difference between the exercising habits of vegetarian and non-vegetarian students.

2.5 Exercise and participation in clubs

Having a look at the distribution of the clubs column using tabyl() function:

clubs n percent
0 41 38.3%
1 22 20.6%
2 16 15.0%
3 18 16.8%
4 3 2.8%
5 5 4.7%
7 1 0.9%
10 1 0.9%

We have to categorize all the options as either Greater than 3 or Less than or equal to 3 to analyse the data.

This time, we don’t need to clean the exercise column as we already did that in previous analysis.

Now, we have our cleaned data ready to be analysed. As we have to check if there is a significant difference in exercising habits of people who involve in more than 3 clubs and who don’t, let’s first check the normality of these two sets using graphical summaries plotted with ggplot2 and plotly library.

The box plot for two groups shows that they are not quite symmetric which does not support the normality assumption. There is also an outlier in the Less than or equal to 3 group which further makes the normality assumption worse.

On the right, we can see the qq-plot of the two groups. For the Less than or equal to 3 group the extremes are pretty substantial, which is quite against the normality assumption. Greater than 3 group’s qq-points seem to roughly lie on the qq-line but it might not be correct to assume normality because of less number of people on Greater than 3 side of the sample.

As we can see from the above qq plot and density plot, we can assume that they roughly follow the same distribution. The dots in the qq-plot also seem to be in similar shape. They do not seem to follow exactly the same distribution, which can be a limitation on using this test.

So, the conclusion is that we can not assume normality of this sample. Further, we can roughly assume that they come from same distribution. Therefore, we can use wilcoxon rank-sum test as it drops the normality assumption.

Specifying the hypothesis properly:

  1. Hypothesis: \(H_{0}: \mu_{x} = \mu_{y}\) vs \(H_{1}: \mu_{x} \not= \mu_{y}\), where \(\mu_{x}\) is the mean of the Greater than 3 group’s (All students of the class) exercising hours and \(\mu_{y}\) is the mean of the Less than or equal to 3 group’s exercising hours.

  2. Assumption: \(X_{i}\) and \(Y_{i}\) are independent and follow the same distribution but differ by a shift, where \(X_{i}\) are all the observations of the exercise done by students who involve in more than 3 clubs, in the sample and \(Y_{i}\) are that of students involving in less than or equal to 3 clubs, \(i \in 1,2,\dots,107\)

As all the students filled their answers independently, all observations are independent. Also, from the above discussion and the density plot, we can roughly assume that they follow the same distribution possibly with a shift. This can be a limitation of using this test.

A short summary of each group’s exercising hours using summarize() function:

clubs n mean sd
Greater than 3 10 2.30 2.06
Less than or equal to 3 97 4.26 4.01
  1. Test Statistic: \(W = R_{1} + R_{2}+\dots+R_{n_{x}}\). Under \(H_{0}\), \(W\) follows the \(WRS(n_{x}, n_{y})\) distribution, where \(n_{x} = 10\), \(n_{y}=97\) and \(R_{i}\) are the ranks of the first group(\(X_{i}\)).
## [1] 401.5
  1. Observed Test Statistic: \(t_{0} =\) 401.5
## 
##  Wilcoxon rank sum test
## 
## data:  exercise by clubs
## W = 346.5, p-value = 0.1355
## alternative hypothesis: true location shift is not equal to 0
  1. P-value: Also, we know that if the test statistic w is greater than its minimum value i.e. \(n_{x}(n_{x} + 1)/2\), then the p-value is \(2P(W\geq w)\).

w = 401.5 and \(n_{x}(n_{x} + 1)/2\) = 55

So,

\(2P(W\geq w) =\) 0.1355486

  1. Decision: As the p-value 0.1355486 is greater than 0.05, we don’t have enough evidence to reject \(H_{0}\). So, the data is consistent with the null hypothesis that there is not a significant difference between the exercising habits of students who involve in more than 3 clubs and those who don’t.

2.6 Living with parents vs Studying hours

Having a look at the distribution of the study column:

study n percent valid_percent
167 1 0.9% 0.9%
312 1 0.9% 0.9%

As we can see, there are some unrealistically large values present in the data which will interfere with our analysis in the form of unrealistic outliers. Taking care of these values as follows:

Here, I have assigned all the values greater than 18*7 as unrealistic. I have removed on-average 6 hours per day for sleeping and any other activity, as it might add extra hours of study for humans.

Now, we have our cleaned data ready to be analysed. As we have to check if there is a significant difference in studying habits of people who live with parents and who don’t, let’s first check the normality of these two sets.

The box plot for two groups shows that they are roughly symmetric which supports the normality assumption. But, there is an outlier in the Yes group which may result in it’s rejection.

On the right side, we can see the qq-plot of the two groups. For the No group the extremes are a bit further from the theoritical quantiles, which does not support the normality assumption. Yes group’s qq points seem to roughly lie on the qq line except one outlier but it can be roughlt assumed that they follow normality because of large number of observations on Yes group.

As we can see from the above qq plot and the below given density plot, we can assume that they roughly follow the normal distribution.

Both plots, roughly seem to follow normal distribution, but the No group seem to have bimodality, which might be a limitation while using t-test. They don’t seem to have similar distribution which does not support the use of wilcoxon sign rank test.

So, here we will use permutation test using t-statistic

Specifying the hypothesis properly:

  1. Hypothesis: \(H_{0}: \mu_{x} = \mu_{y}\) vs \(H_{1}: \mu_{x} \not= \mu_{y}\), where \(\mu_{x}\) is the mean of the group of students not living with their parents’ study hours and \(\mu_{y}\) is the mean of the study hours of students living with parents.

  2. Assumption: \(X_{i}\) and \(Y_{i}\) are independent and exchangeable, where \(X_{i}\) are all the observations of the study hours of students not living with parents in the sample and \(Y_{i}\) are that of students living with parents, \(i \in 1,2,\dots,52\)

As all the students filled their answers independently, all observations are independent. Also, as they don’t have any relation, their groups can be exchanged.

A short summary of each group’s exercising hours:

live_with_parents n mean sd
No 57 27.70 14.02
Yes 47 29.04 17.13
  1. Test Statistic: \(T = (\overline X - \overline Y)/(s\sqrt n)\). Under \(H_{0}\), not necessarily following \(t_{n-1}\) distribution.
  1. Observed Test Statistic: \(t_{0} =\) -0.4306861
## [1] 0.662
  1. P-value: It will be the probability of getting the observed test statistic less than the array of test statistics we got from permutation.

$p~value = 2P(t_null > t_{0}) = $ 0.662

  1. Decision: As the p-value 0.662 is quite greater than 0.05, we don’t have enough evidence to reject \(H_{0}\). So, the data is consistent with the null hypothesis that there is not a significant difference between the studying habit between students who live with their parents and those who don’t.

Let’s check the result of the permutation t-test with our normal Welch two sample t-test. It might be valid as from the previous discussion we can say that our sample roughly follows normal distribution.

## 
##  Welch Two Sample t-test
## 
## data:  study by live_with_parents
## t = -0.43069, df = 88.615, p-value = 0.6677
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.526971  4.845373
## sample estimates:
##  mean in group No mean in group Yes 
##          27.70175          29.04255

As we can see, that Welch-two sample test gives us the p_value of 0.6677424, which is quite similar to what we got from the permutation test. This further solidifies our decision from the permutation test that, we don’t have enough evidence to reject the null hypothesis.

2.7 Growing up with pet and Living with parents

Here, we will try to find any relation between currently living with parents and growing up with a pet.

Let’s have a look at the contingency table:

##   live_with_parents pet_growing_up Freq
## 1                No             No   34
## 2               Yes             No   13
## 3                No            Yes   25
## 4               Yes            Yes   35

Visualizing our table using ggplot2 and plotly library:

Just a rough look suggests a relation between these two variables, but we have to prove it statistically. Here, we will use ‘Test for Independence’ to check the independence between them.

  1. Hypothesis: \(H_{0}: p_{ij} = p_{∙j}p_{i∙}~, i=1,2;j=1,2,3,4\) vs \(H_{1}:\)Not all equalities hold

  2. Assumption: \(e_{i} = y_{∙j}y_{i∙}/n\ge 5\).

Making a matrix with same values for easier and effective calculations:

Checking assumptions:

## [1] TRUE

The assumption seems to be true as all the expected values are greater than 5.

  1. Test Statistic: \(T = \sum_{i=1}^{r}\sum_{j=1}^c\frac{(y_{ij} - e_{ij})^2}{e_{ij}}\) Under \(H_{0}\), \(T\sim\chi_{(2-1)(2-1)}^2\) approx. The degree of freedom was calculated with the formula \(df = (r-1)(c-1)\), where r is number of rows and c is number of columns.

Using chisq.test function for the Pearson’s Chi-squared test:

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  y.mat
## X-squared = 8.823, df = 1, p-value = 0.002974
  1. Observed Test Statistic: \(t_{0} = 8.823\)

  2. P-value: \(P(\chi_{1}^2 \geq 8.823) = 0.002974\)

  3. Decision: Since the p-value (0.002974) is a quite smaller than 0.05, it is concluded that there is enough evidence to reject the null hypothesis and this data set is not consistent with the null hypothesis stating that living with parents and growing up with parents are independent of each other.

3 DATA2902 Question

3.1 Hotelling-T test

Let’s assume that another australian university claims to have the means of study hours, employment hours and height to be 20 hours, 10 hours and 180 cm respectively. To check if their means are the same as ours, let us perform a multivariate mean test using Hotelling’s T-test.

Cleaning the employment hours, studying hours and height column i.e. removing some unnecessary outliers. Again considering only 18 hours out of 24 hours as productive, removing sleep and other activities (using mutate function).

Considering visual aid (Boxplot and QQplot) for normality assumption using ggplot2 and plotly libraries:

  • Studying hours
  • Employment hours:
  • Height:

As, we can see from above, Study hours has 2 outliers and a roughly symmetric boxplot. On extremes, there is a deviation from the theorotical quantiles. In the employment hours boxplot, there is limited symmetry, it’s qq-plot also shows huge variation. Height plots show much better symmetry and less variation in qq-plot. So, assuming the normality of these sets, might show deviation from actual results. This can be a huge limitation for hotelling’s t test as it assumes multivariate normality.

Specifying the hypothesis properly:

  1. Hypothesis: \(H_{0}: \boldsymbol \mu~=\boldsymbol \mu_{0}\) vs \(H_{1}: \boldsymbol \mu\not=\boldsymbol \mu_{0}\), where \(\boldsymbol \mu\) is the array containing mean values from our sample and \(\boldsymbol \mu_{0}\) is the array containing mean values from the other university.

  2. Assumption: \(\boldsymbol X_{1},\) \(\boldsymbol X_{2},\) \(\dots,\) \(\boldsymbol X_{p}\) are independent and follow multivariate normal distribution \(N_{p}(\boldsymbol \mu, \boldsymbol \sum)\) where \(\boldsymbol X_{i}\) are all the observations from our sample and \(p\) is the number of constraints checking simultaneously, valued at 3.

As all the students filled their answers independently, all observations are independent. Also, we are assuming that they follow multivariate normal distribution as discussed above.

  1. Test Statistic: \(T^2=\) \(n\) \((\) \(\overline X - \boldsymbol \mu_{0}\) \()^T\) \(\boldsymbol S^-1\) \((\) \(\overline X\) \(-\) \(\boldsymbol \mu_{0}\) \()\). Under \(H_{0}\), \(T^2 \tilde~F_{p, n-p}\) as n is large.
  1. Observed Test Statistic: \(t_{0} =\) 83.6845157
## 
##  Hotelling's one sample T2-test
## 
## data:  hotelling_sample
## T.2 = 27.342, df1 = 3, df2 = 99, p-value = 5.779e-13
## alternative hypothesis: true location is not equal to c(20,10,180)
  1. P-value: \(p~value = P(F_{3,99} \geq t_{0}) =\) 5.778710810^{-13}

  2. Decision: As the p-value (<0.0001) is quite smaller than 0.05, we have enough evidence to reject \(H_{0}\). So, the data is not consistent with the null hypothesis that the mean of study hours, employment hours and height of other australian university is same as our class of DATA2X02.

4 Conclusion

Various biases discussed above like Measurement bias, Selection bias and Sampling method are the liimitations in this analysis as they prevent the sample to be a proper representative of the concerned population. Many tests performed above are based on assumptions and rough approximation making them prone to errors.

5 References